Benchmarking Strategy for Arabic Screen-Rendered Word Recognition

نویسندگان

  • Fouad Slimane
  • Slim Kanoun
  • Jean Hennebert
  • Rolf Ingold
  • Adel M. Alimi
چکیده

This chapter presents a new benchmarking strategy for Arabic screenbased word recognition. Firstly, we report on the creation of the new APTI (Arabic Printed Text Image) database. This database is a large-scale benchmarking of open-vocabulary, multi-font, multi-size and multi-style word recognition systems in Arabic. Such systems take as input a text image and compute as output a character string corresponding to the text included in the image. The challenges that are addressed by the database are in the variability of the sizes, fonts and styles used to generate the images. A focus is also given on low resolution images where anti-aliasing is generating noise on the characters being recognized. The database contains 45,313,600 single word images totalling more than 250 million characters. Ground truth annotation is provided for each image from an XML file. The annotation includes the number of characters, the number of pieces of Arabic words (PAWs), the sequence of characters, the size, the style, the font used to generate each image, etc. Secondly, we describe the Arabic Recognition Competition: Multi-Font Multi-Size Digitally Represented Text held in the context of the 11th International Conference on Document Analysis and Recognition (ICDAR’2011), during September 18–21, 2011, Beijing, China. This first F. Slimane (!) · J. Hennebert · R. Ingold DIVA Group, Department of Informatics, Universtity of Fribourg, Bd. de Perolles 90, 1700 Fribourg, Switzerland e-mail: [email protected] J. Hennebert e-mail: [email protected] R. Ingold e-mail: [email protected] S. Kanoun National School of Engineers (ENIS), University of Sfax, BP 1173, Sfax 3038, Tunisia e-mail: [email protected] A.M. Alimi REGIM Group, National School of Engineers (ENIS), University of Sfax, BP 1173, Sfax 3038, Tunisia e-mail: [email protected] V. Märgner, H. El Abed (eds.), Guide to OCR for Arabic Scripts, DOI 10.1007/978-1-4471-4072-6_18, © Springer-Verlag London 2012 423 424 F. Slimane et al. edition of the competition used the freely available APTI database. Two groups with three systems participated in the competition. The systems were compared using the recognition rates at the character and word levels. The systems were tested on one test dataset which is unknown to all participants (set 6 of APTI database). The systems were compared on the ground of the most important characteristic of classification systems: the recognition rate. A short description of the participating groups, their systems, the experimental setup and the observed results are presented. Thirdly, we present our DIVA-REGIM system (out of competition at ICDAR’2011) with all results of the Arabic recognition competition protocols.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model

In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...

متن کامل

The Effect of Pictorial Flashcards on the Sight Word Recognition in Kindergartens

It was a quasi-experimental study because the study involved in training participants in twoclasses each containing about 5 to 6 years old pre-primary students. To this end, fifty studentsparticipated in the study who were studying at Misagh School in Tabriz. In order to makesure of their homogeneity, the researcher administered a pre-test. Based on the results, 40students were selected as the ...

متن کامل

Segmenting Arabic Handwritten Documents into Text lines and Words

In this paper, we present a method for segmenting Arabic handwritten documents into text lines and words. Text line segmentation is addressed by a well-known technique, the horizontal projection profile, in which autocorrelation is used to enhance the self similarity of this profile. This technique promotes the estimation of text line spacing. Word extraction is based on an adaptation of a know...

متن کامل

ASM Based Synthesis of Handwritten Arabic Text Pages

Document analysis tasks, as text recognition, word spotting, or segmentation, are highly dependent on comprehensive and suitable databases for training and validation. However their generation is expensive in sense of labor and time. As a matter of fact, there is a lack of such databases, which complicates research and development. This is especially true for the case of Arabic handwriting reco...

متن کامل

Arabic Character Recognition using Approximate Stroke Sequence

Arabic character recognition of handwriting is addressed. A novel approach for the Arabic Character Recognition is presented based on statistical analysis of a typical Arabic text is presented. Results showed that the sub-word in Arabic language is the basic pictorial block rather than the word. The method of approximate stroke sequence is applied for the recognition of some Arabic characters i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012